{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dcc9ef1a",
   "metadata": {},
   "source": [
    "**<font color = black size=6>实验六:随机森林</font>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ef6f7c39",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import math"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6c19637",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第一部分:函数介绍</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3302e25c",
   "metadata": {},
   "source": [
    "介绍一些可能会用到的函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9ba5e802",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[4 3 1 1 2 1 3 3 3 4]\n",
      "[[ 9 10 11]\n",
      " [ 3  4  5]\n",
      " [ 0  1  2]\n",
      " [12 13 14]\n",
      " [ 6  7  8]]\n",
      "   0  1  2\n",
      "1  1  1  1\n",
      "0  0  0  0\n",
      "4  4  4  4\n",
      "2  2  2  2\n",
      "3  3  3  3\n",
      "   0  1  2\n",
      "2  2  2  2\n",
      "0  0  0  0\n"
     ]
    }
   ],
   "source": [
    "# np.random.choice函数从一个一维数组中随机采样\n",
    "x = np.array([1,2,3,4])\n",
    "y = np.random.choice(x, replace=True, size=10)\n",
    "print(y)\n",
    "\n",
    "# np.random.shuffle函数对一个数组/矩阵按照第一维进行洗牌\n",
    "x = np.array([[0,1,2],[3,4,5],[6,7,8],[9,10,11],[12,13,14]])\n",
    "np.random.shuffle(x)\n",
    "print(x)\n",
    "\n",
    "# DataFrame对象的sample函数可以随机采样n个数据或者采样比例为frac的数据\n",
    "x = np.array([[0,0,0],[1,1,1],[2,2,2],[3,3,3],[4,4,4]])\n",
    "frame = pd.DataFrame(x)\n",
    "print(frame.sample(n=5))\n",
    "print(frame.sample(frac=0.3))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca268aef",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第二部分:实验任务</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5365ec76",
   "metadata": {},
   "source": [
    "本次实验承接上次实验，实现随机森林。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a96a745",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">本次实验依旧使用泰坦尼克号数据集(train_titanic.csv, test_titanic.csv。数据集包括了四个属性特征以及一个标签(即为Survived,代表是否生还),属性特征分别为Sex(性别)，sibsp(堂兄弟妹个数)，Parch(父母与小孩的个数)，Pclass(乘客等级)  \n",
    "其中该数据集无缺失值和异常值，且所有连续变量已自动转换为离散变量,标签(Survived)也自动转变为离散变量0和1，所以你无需进行数据预处理，可以直接使用该数据集。</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e967202a",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">1) 对上次实验的best_split函数进行修改，实现随机特征选择。  \n",
    "先从特征集$A$中先随机选取$k$个特征构成特征集$A'$，再从$A'$中选取最佳划分的特征。$k$一般取$max\\{log_2 d,1\\}$, $d$是$A$的元素的个数。你可使用特征的信息增益来决定最佳划分的特征。  \n",
    "    【输入】：数据集D、特征集A    \n",
    "    【输出】：随机特征集A'中最佳划分的特征维数   \n",
    "    【信息增益公式】:  \n",
    "        某数据集D有若干特征值以及对应的标签值，其总样本大小为|D|,这里取其中一个特征feature,该特征包含V个不同的取值，特征值为第v(v=1,2,...,V)个值的数量为|$D^v$|$(\\sum_{v=1}^VD^v=|D|)$,则该特征对应的信息增益为$$Gain(D,feature)=Ent(D)-\\sum_{v=1}^K \\frac{|D^v|}{D} Ent(D^v)$$  \n",
    "    【信息熵公式】:  \n",
    "        某数组包含K个不同的取值，样本为第k(k=1,2,...,K)个值的数量所占比例为p_k,则其信息熵为$$Ent=-\\sum_{k=1}^K p_k log_2 p_k$$\n",
    "</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b55657f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def entropy(label):\n",
    "    unique_labels, label_counts = np.unique(label, return_counts=True)\n",
    "    total_samples = len(label)\n",
    "    ent = 0\n",
    "\n",
    "    for count in label_counts:\n",
    "        probability = count / total_samples\n",
    "        ent -= probability * np.log2(probability)\n",
    "\n",
    "    return ent\n",
    "\n",
    "def best_split(D, A):\n",
    "    d = len(A)\n",
    "    k = max(np.log2(d), 1)\n",
    "    A_prime = np.random.choice(list(A), int(k), replace=False)\n",
    "\n",
    "    def split_by_value(feature, label, value):\n",
    "        indices = np.where(feature == value)\n",
    "        split_label = label[indices]\n",
    "\n",
    "        return split_label\n",
    "\n",
    "    best_information_gain = 0\n",
    "    best_dimension = None\n",
    "\n",
    "    for dimension in A_prime:\n",
    "        feature_values = D[:, dimension]\n",
    "        unique_values = np.unique(feature_values)\n",
    "        information_gain = entropy(\n",
    "            D[:, -1]\n",
    "        )  # Initialize with the entropy of the whole dataset\n",
    "\n",
    "        for value in unique_values:\n",
    "            split_label = split_by_value(feature_values, D[:, -1], value)\n",
    "            information_gain -= (len(split_label) / len(D)) * entropy(split_label)\n",
    "\n",
    "        if information_gain > best_information_gain:\n",
    "            best_information_gain = information_gain\n",
    "            best_dimension = dimension\n",
    "\n",
    "    return best_dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1aa003e",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">2) 对上次实验完成的决策树类进行修改。你需要实现下面三个函数：  \n",
    "1. TreeGenerate(self, D, A)：递归构建决策树，伪代码参照提供的“Algorithm 1 决策树学习基本算法”。  \n",
    "2. train(self, D)：做一些数据预处理，包括将Dataframe转换为numpy矩阵，从数据集中提取属性集，并调用TreeGenerate函数来递归地生成决策树。  \n",
    "3. predict(self, D)：对测试集D进行预测，要求返回数据集D的预测标签，即一个(|D|,1)矩阵（|D|行1列）。  \n",
    "由于训练集是采样生成，因此需要对predict函数做修改。需要考虑测试集中出现决策树无法划分的特征值时的情况。给出两种参考的做法：  \n",
    "a).对其不再进行预测，直接给定划分失败的样本标签(例如-1)。  \n",
    "b).跳过该划分节点，随机选取一个特征值继续遍历。</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a275ce54",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 记下所有属性可能的取值\n",
    "train_frame = pd.read_csv('train_titanic.csv')\n",
    "D = np.array(train_frame)\n",
    "A = set(range(D.shape[1] - 1))\n",
    "possible_value = {}\n",
    "for every in A:\n",
    "    possible_value[every] = np.unique(D[:, every])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "d03dbc02",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 树结点类\n",
    "class Node:\n",
    "    def __init__(self, isLeaf=True, label=-1, index=-1):\n",
    "        self.isLeaf = isLeaf # isLeaf表示该结点是否是叶结点\n",
    "        self.label = label # label表示该叶结点的label（当结点为叶结点时有用）\n",
    "        self.index = index # index表示该分支结点的划分属性的序号（当结点为分支结点时有用）\n",
    "        self.children = {} # children表示该结点的所有孩子结点，dict类型，方便进行决策树的搜索\n",
    "        \n",
    "    def addNode(self, val, node):\n",
    "        self.children[val] = node #为当前结点增加一个划分属性的值为val的孩子结点"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "12e76c13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 决策树类\n",
    "class DTree:\n",
    "    def __init__(self):\n",
    "        self.tree_root = None  # 决策树的根结点\n",
    "        self.possible_value = possible_value  # 用于存储每个特征可能的取值\n",
    "\n",
    "    \"\"\"\n",
    "    TreeGenerate函数用于递归构建决策树，伪代码参照课件中的“Algorithm 1 决策树学习基本算法”\n",
    "     \n",
    "    \"\"\"\n",
    "\n",
    "    def TreeGenerate(self, D, A):\n",
    "        # 生成结点 node\n",
    "        node = Node()\n",
    "\n",
    "        # if D中样本全属于同一类别C then\n",
    "        #     将node标记为C类叶结点并返回\n",
    "        # end if\n",
    "        if len(np.unique(D[:, -1])) == 1:\n",
    "            node.isLeaf = True\n",
    "            node.label = D[0, -1]\n",
    "            return node\n",
    "\n",
    "        # if A = Ø OR D中样本在A上取值相同 then\n",
    "        #     将node标记叶结点，其类别标记为D中样本数最多的类并返回\n",
    "        # end if\n",
    "        tmp = np.array(list(A))\n",
    "        if len(tmp) == 0 or len(np.unique(D[:, tmp])) == 1:\n",
    "            node.isLeaf = True\n",
    "            node.label = Counter(D[:, -1]).most_common(1)[0][0]\n",
    "            return node\n",
    "\n",
    "        # 从A中选择最优划分特征a_star\n",
    "        # （选择信息增益最大的特征，用到上面实现的best_split函数）\n",
    "        a_star = best_split(D, A)\n",
    "\n",
    "        # for a_star 的每一个值a_star_v do\n",
    "        #     为node 生成每一个分支；令D_v表示D中在a_star上取值为a_star_v的样本子集\n",
    "        #     if D_v 为空 then\n",
    "        #         将分支结点标记为叶结点，其类别标记为D中样本最多的类\n",
    "        #     else\n",
    "        #         以TreeGenerate(D_v,A-{a_star}) 为分支结点\n",
    "        #     end if\n",
    "        # end for\n",
    "        # print(\"a_star:\", a_star)\n",
    "        if a_star is not None:\n",
    "            for a_star_v in np.unique(D[:, a_star]):\n",
    "                D_v = D[D[:, a_star] == a_star_v]\n",
    "                if len(D_v) == 0:\n",
    "                    node.addNode(\n",
    "                        a_star_v, Node(True, Counter(D[:, -1]).most_common(1)[0][0])\n",
    "                    )\n",
    "                else:\n",
    "                    node.addNode(\n",
    "                        a_star_v,\n",
    "                        self.TreeGenerate(\n",
    "                            D[D[:, a_star] == a_star_v], A - {a_star}\n",
    "                        ),  # 递归调用TreeGenerate函数\n",
    "                    )\n",
    "\n",
    "        # def __init__(self, isLeaf=True, label=-1, feature_index=-1) node的构造函数中，isLeaf默认为True，feature_index默认为-1。而函数执行到这里时，node的值需要赋值为False，feature_index需要赋值为a_star\n",
    "            node.isLeaf = False\n",
    "            node.feature_index = a_star\n",
    "            return node\n",
    "        else:\n",
    "            node.isLeaf = True\n",
    "            node.label = Counter(D[:, -1]).most_common(1)[0][0]\n",
    "            return node\n",
    "\n",
    "    \"\"\"\n",
    "    train函数可以做一些数据预处理（比如Dataframe到numpy矩阵的转换，提取属性集等），并调用TreeGenerate函数来递归地生成决策树\n",
    " \n",
    "    \"\"\"\n",
    "\n",
    "    def train(self, D):\n",
    "        D = np.array(D)  # 将Dataframe对象转换为numpy矩阵（也可以不转，自行决定做法）\n",
    "        \n",
    "        A = set(range(D.shape[1] - 1))  # 特征集A\n",
    "        # print(A)\n",
    "\n",
    "        self.tree_root = self.TreeGenerate(D, A)  # 递归地生成决策树，并将决策树的根结点赋值给self.tree_root\n",
    "        return\n",
    "\n",
    "    \"\"\"\n",
    "    predict函数对测试集D进行预测， 并输出预测准确率（预测正确的个数 / 总数据数量）\n",
    "    \n",
    "    \"\"\"\n",
    "\n",
    "    def predict(self, D):\n",
    "        D = np.array(D)  # 将Dataframe对象转换为numpy矩阵（也可以不转，自行决定做法）\n",
    "        # 对于D中的每一行数据d，从当前结点x=self.tree_root开始，当当前结点x为分支结点时，\n",
    "        # 则搜索x的划分特征为该行数据相应的特征值的孩子结点（即x=x.children[d[x.index]]），不断重复，\n",
    "        # 直至搜索到叶结点，该叶结点的标签就是数据d的预测标签\n",
    "        label = []\n",
    "        for d in D:\n",
    "            x = self.tree_root\n",
    "            succeed = True\n",
    "            while not x.isLeaf:\n",
    "                # print(\"x.feature_index:\", x.feature_index)\n",
    "                if d[x.feature_index] not in x.children:\n",
    "                    succeed = False\n",
    "                    break\n",
    "                x = x.children[d[x.feature_index]]\n",
    "            # print(x.label)\n",
    "            if succeed:\n",
    "                label.append(x.label)\n",
    "            else:\n",
    "                # 对其不再进行预测，直接给定划分失败的样本标签为-1\n",
    "                label.append(-1)\n",
    "        return label"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42ff8753",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">3) 重复采用Bootstrap自助采样法对训练数据集'test_titanic.csv'进行采样，生成$n$个子训练数据集($n$自行设定)。  \n",
    "Bootstrap采样法是指，每次从原数据集中【有放回】地随机采样一个样本，重复进行$m$次，就生成一个有$m$个样本的子数据集。</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b674f7a3",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'Counter' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[14], line 10\u001b[0m\n\u001b[0;32m      8\u001b[0m D \u001b[38;5;241m=\u001b[39m D[np\u001b[38;5;241m.\u001b[39mrandom\u001b[38;5;241m.\u001b[39mchoice(D\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m], D\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m0\u001b[39m], replace\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)]\n\u001b[0;32m      9\u001b[0m D \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mDataFrame(D)\n\u001b[1;32m---> 10\u001b[0m tree[i]\u001b[38;5;241m.\u001b[39mtrain(D)\n",
      "Cell \u001b[1;32mIn[13], line 81\u001b[0m, in \u001b[0;36mDTree.train\u001b[1;34m(self, D)\u001b[0m\n\u001b[0;32m     78\u001b[0m A \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mset\u001b[39m(\u001b[38;5;28mrange\u001b[39m(D\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m] \u001b[38;5;241m-\u001b[39m \u001b[38;5;241m1\u001b[39m))  \u001b[38;5;66;03m# 特征集A\u001b[39;00m\n\u001b[0;32m     79\u001b[0m \u001b[38;5;66;03m# print(A)\u001b[39;00m\n\u001b[1;32m---> 81\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtree_root \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mTreeGenerate(D, A)  \u001b[38;5;66;03m# 递归地生成决策树，并将决策树的根结点赋值给self.tree_root\u001b[39;00m\n\u001b[0;32m     82\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n",
      "Cell \u001b[1;32mIn[13], line 56\u001b[0m, in \u001b[0;36mDTree.TreeGenerate\u001b[1;34m(self, D, A)\u001b[0m\n\u001b[0;32m     50\u001b[0m             node\u001b[38;5;241m.\u001b[39maddNode(\n\u001b[0;32m     51\u001b[0m                 a_star_v, Node(\u001b[38;5;28;01mTrue\u001b[39;00m, Counter(D[:, \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m1\u001b[39m])\u001b[38;5;241m.\u001b[39mmost_common(\u001b[38;5;241m1\u001b[39m)[\u001b[38;5;241m0\u001b[39m][\u001b[38;5;241m0\u001b[39m])\n\u001b[0;32m     52\u001b[0m             )\n\u001b[0;32m     53\u001b[0m         \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m     54\u001b[0m             node\u001b[38;5;241m.\u001b[39maddNode(\n\u001b[0;32m     55\u001b[0m                 a_star_v,\n\u001b[1;32m---> 56\u001b[0m                 \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mTreeGenerate(\n\u001b[0;32m     57\u001b[0m                     D[D[:, a_star] \u001b[38;5;241m==\u001b[39m a_star_v], A \u001b[38;5;241m-\u001b[39m {a_star}\n\u001b[0;32m     58\u001b[0m                 ),  \u001b[38;5;66;03m# 递归调用TreeGenerate函数\u001b[39;00m\n\u001b[0;32m     59\u001b[0m             )\n\u001b[0;32m     61\u001b[0m \u001b[38;5;66;03m# def __init__(self, isLeaf=True, label=-1, feature_index=-1) node的构造函数中，isLeaf默认为True，feature_index默认为-1。而函数执行到这里时，node的值需要赋值为False，feature_index需要赋值为a_star\u001b[39;00m\n\u001b[0;32m     62\u001b[0m     node\u001b[38;5;241m.\u001b[39misLeaf \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n",
      "Cell \u001b[1;32mIn[13], line 56\u001b[0m, in \u001b[0;36mDTree.TreeGenerate\u001b[1;34m(self, D, A)\u001b[0m\n\u001b[0;32m     50\u001b[0m             node\u001b[38;5;241m.\u001b[39maddNode(\n\u001b[0;32m     51\u001b[0m                 a_star_v, Node(\u001b[38;5;28;01mTrue\u001b[39;00m, Counter(D[:, \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m1\u001b[39m])\u001b[38;5;241m.\u001b[39mmost_common(\u001b[38;5;241m1\u001b[39m)[\u001b[38;5;241m0\u001b[39m][\u001b[38;5;241m0\u001b[39m])\n\u001b[0;32m     52\u001b[0m             )\n\u001b[0;32m     53\u001b[0m         \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m     54\u001b[0m             node\u001b[38;5;241m.\u001b[39maddNode(\n\u001b[0;32m     55\u001b[0m                 a_star_v,\n\u001b[1;32m---> 56\u001b[0m                 \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mTreeGenerate(\n\u001b[0;32m     57\u001b[0m                     D[D[:, a_star] \u001b[38;5;241m==\u001b[39m a_star_v], A \u001b[38;5;241m-\u001b[39m {a_star}\n\u001b[0;32m     58\u001b[0m                 ),  \u001b[38;5;66;03m# 递归调用TreeGenerate函数\u001b[39;00m\n\u001b[0;32m     59\u001b[0m             )\n\u001b[0;32m     61\u001b[0m \u001b[38;5;66;03m# def __init__(self, isLeaf=True, label=-1, feature_index=-1) node的构造函数中，isLeaf默认为True，feature_index默认为-1。而函数执行到这里时，node的值需要赋值为False，feature_index需要赋值为a_star\u001b[39;00m\n\u001b[0;32m     62\u001b[0m     node\u001b[38;5;241m.\u001b[39misLeaf \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n",
      "    \u001b[1;31m[... skipping similar frames: DTree.TreeGenerate at line 56 (1 times)]\u001b[0m\n",
      "Cell \u001b[1;32mIn[13], line 56\u001b[0m, in \u001b[0;36mDTree.TreeGenerate\u001b[1;34m(self, D, A)\u001b[0m\n\u001b[0;32m     50\u001b[0m             node\u001b[38;5;241m.\u001b[39maddNode(\n\u001b[0;32m     51\u001b[0m                 a_star_v, Node(\u001b[38;5;28;01mTrue\u001b[39;00m, Counter(D[:, \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m1\u001b[39m])\u001b[38;5;241m.\u001b[39mmost_common(\u001b[38;5;241m1\u001b[39m)[\u001b[38;5;241m0\u001b[39m][\u001b[38;5;241m0\u001b[39m])\n\u001b[0;32m     52\u001b[0m             )\n\u001b[0;32m     53\u001b[0m         \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m     54\u001b[0m             node\u001b[38;5;241m.\u001b[39maddNode(\n\u001b[0;32m     55\u001b[0m                 a_star_v,\n\u001b[1;32m---> 56\u001b[0m                 \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mTreeGenerate(\n\u001b[0;32m     57\u001b[0m                     D[D[:, a_star] \u001b[38;5;241m==\u001b[39m a_star_v], A \u001b[38;5;241m-\u001b[39m {a_star}\n\u001b[0;32m     58\u001b[0m                 ),  \u001b[38;5;66;03m# 递归调用TreeGenerate函数\u001b[39;00m\n\u001b[0;32m     59\u001b[0m             )\n\u001b[0;32m     61\u001b[0m \u001b[38;5;66;03m# def __init__(self, isLeaf=True, label=-1, feature_index=-1) node的构造函数中，isLeaf默认为True，feature_index默认为-1。而函数执行到这里时，node的值需要赋值为False，feature_index需要赋值为a_star\u001b[39;00m\n\u001b[0;32m     62\u001b[0m     node\u001b[38;5;241m.\u001b[39misLeaf \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n",
      "Cell \u001b[1;32mIn[13], line 30\u001b[0m, in \u001b[0;36mDTree.TreeGenerate\u001b[1;34m(self, D, A)\u001b[0m\n\u001b[0;32m     28\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(tmp) \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m0\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(np\u001b[38;5;241m.\u001b[39munique(D[:, tmp])) \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m     29\u001b[0m     node\u001b[38;5;241m.\u001b[39misLeaf \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[1;32m---> 30\u001b[0m     node\u001b[38;5;241m.\u001b[39mlabel \u001b[38;5;241m=\u001b[39m Counter(D[:, \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m1\u001b[39m])\u001b[38;5;241m.\u001b[39mmost_common(\u001b[38;5;241m1\u001b[39m)[\u001b[38;5;241m0\u001b[39m][\u001b[38;5;241m0\u001b[39m]\n\u001b[0;32m     31\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m node\n\u001b[0;32m     33\u001b[0m \u001b[38;5;66;03m# 从A中选择最优划分特征a_star\u001b[39;00m\n\u001b[0;32m     34\u001b[0m \u001b[38;5;66;03m# （选择信息增益最大的特征，用到上面实现的best_split函数）\u001b[39;00m\n",
      "\u001b[1;31mNameError\u001b[0m: name 'Counter' is not defined"
     ]
    }
   ],
   "source": [
    "train_frame = pd.read_csv('train_titanic.csv')\n",
    "\n",
    "# Bootstrap 采样\n",
    "n = 10\n",
    "tree = [DTree()] * n\n",
    "for i in range(n):\n",
    "    D = np.array(train_frame)\n",
    "    D = D[np.random.choice(D.shape[0], D.shape[0], replace=True)]\n",
    "    D = pd.DataFrame(D)\n",
    "    tree[i].train(D)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcd780da",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">4) 生成n棵决策树实例，使用上述生成的n个子训练数据集各自训练一棵决策树，即子训练集D1训练决策树1，子训练集D2训练决策树2……</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13d5c224",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ----- Your code here -------\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7837970",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">5) 用训练完成的$n$棵决策树分别对测试数据集'test_titanic.csv'进行预测。采用相对多数投票法来对各棵决策树的预测结果进行结合。输出结合的预测结果的准确率。  \n",
    "【相对多数投票法】  \n",
    "对于某个样本$x$, 相对多数投票法预测它的标签为得票最多的标签。若同时有多个标签获得最高票，则从中随机选取一个。其公式如下所示：\n",
    "$$H(x)=C_{\\mathop{\\arg\\max}_{j} \\sum_{i=1}^n h_i^j(x)}$$  \n",
    "</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c9896a22",
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'NoneType' object has no attribute 'isLeaf'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[15], line 6\u001b[0m\n\u001b[0;32m      4\u001b[0m result \u001b[38;5;241m=\u001b[39m []\n\u001b[0;32m      5\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m t \u001b[38;5;129;01min\u001b[39;00m tree:\n\u001b[1;32m----> 6\u001b[0m     result\u001b[38;5;241m.\u001b[39mappend(t\u001b[38;5;241m.\u001b[39mpredict(test_frame))\n\u001b[0;32m      8\u001b[0m \u001b[38;5;66;03m# 相对多数投票\u001b[39;00m\n\u001b[0;32m      9\u001b[0m result \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39marray(result)\n",
      "Cell \u001b[1;32mIn[13], line 98\u001b[0m, in \u001b[0;36mDTree.predict\u001b[1;34m(self, D)\u001b[0m\n\u001b[0;32m     96\u001b[0m x \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtree_root\n\u001b[0;32m     97\u001b[0m succeed \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[1;32m---> 98\u001b[0m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m x\u001b[38;5;241m.\u001b[39misLeaf:\n\u001b[0;32m     99\u001b[0m     \u001b[38;5;66;03m# print(\"x.feature_index:\", x.feature_index)\u001b[39;00m\n\u001b[0;32m    100\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m d[x\u001b[38;5;241m.\u001b[39mfeature_index] \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m x\u001b[38;5;241m.\u001b[39mchildren:\n\u001b[0;32m    101\u001b[0m         succeed \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mFalse\u001b[39;00m\n",
      "\u001b[1;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'isLeaf'"
     ]
    }
   ],
   "source": [
    "test_frame = pd.read_csv('test_titanic.csv')\n",
    "\n",
    "# ----- Your code here -------\n",
    "result = []\n",
    "for t in tree:\n",
    "    result.append(t.predict(test_frame))\n",
    "\n",
    "# 相对多数投票\n",
    "result = np.array(result)\n",
    "res = []\n",
    "for i in range(len(result[0])):\n",
    "    res.append(Counter(result[:, i]).most_common(1)[0][0])\n",
    "\n",
    "# print(res)\n",
    "\n",
    "accuracy = np.sum(res == test_frame['Survived']) / len(test_frame)\n",
    "\n",
    "print(\"Bootstrap采样的准确率为：\", accuracy)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f6f62e5",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第三部分:作业提交</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "615aba6d",
   "metadata": {},
   "source": [
    "一、实验课下课前提交完成代码，如果下课前未完成，请将已经完成的部分进行提交，未完成的部分于之后的实验报告中进行补充  \n",
    "要求:  \n",
    "1)文件格式为：学号-姓名.ipynb  \n",
    "2)不要提交文件夹、压缩包、数据集等无关文件，只需提交单个ipynb文件即可"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0a43fe8",
   "metadata": {},
   "source": [
    "二、实验报告提交截止日期为：【10月27日 14:20】  \n",
    "提交地址：https://send2me.cn/g_kfMtFI/SuiqyPO6B7rxqg  \n",
    "要求：  \n",
    "1)文件格式为：学号-姓名-实验六.pdf  \n",
    "2)【不要】提交文件夹、压缩包、代码文件、数据集等任何与实验报告无关的文件，只需要提交单个pdf文件即可  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26699b34",
   "metadata": {},
   "source": [
    "三、课堂课件获取地址: https://www.jianguoyun.com/p/DTGgCYAQp5WhChir26EFIAA  \n",
    "实验内容获取地址: https://www.jianguoyun.com/p/DekWAFoQp5WhChis26EFIAA"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
